This project is made to be read in html, so open the html file in your preferred webbrowser. As standard the code is hidden in this document, but you can show all by pressing the button “Code” in the top right of the document. You can also show individual chunks of code by pressing the buttons “Code” which are placed around in the document.
Link for google colab:
Link for github: https://github.com/DataEconomistDK/M2-Group-Assignment
In this project we will work with a dataset of 5.000 consumer reviews for a few Amazon electronic products like f. ex. Kindle. Data is collected between September 2017 and October 2018. This is a sample taken from Kaggle which is a part of a much bigger dataset available trough Datafiniti. The data can be collected from this link: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products?fbclid=IwAR1o_blPfHeBPmnUzAOW7Ct24L7fhbI3OGcbfaVgaDZENhVXwaCP4godKvQ#Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv
Note there is 3 available dataset on kaggle, but the file used here is called “Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products”. The file is downloaded as is, and imported further below.
First i have some personal setup in my local R-Markdown on how i want to display warnings ect. And then i load my packages.
Now we load the data we downloaded from kaggle. From this file we select the following variables:
id: An id number given to each review created by us corrensponding to the row number of the raw data.
name: The full name of the product
reviews.rating: The rating of the product on a scale from 1-5.
reviews.title: The title of the review, given by the customer.
reviews.text: The review text written by the customer.
data_raw <- read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv") %>%
select(name, reviews.rating, reviews.text, reviews.title) %>%
mutate(id = row_number())
As the data is very raw and messy we now do some cleaning. We remove everything that is not normal letters, such as punctuations, numbers, special characters ect, and changing all strings to lower case in the review text.
We will also do some lemmatization. The purpose of this is to not only analyze the exact word strings in the reviews, as this would include several possible forms of the words used. F. ex. think and thought. Instead we want to merge all possible forms of a word into it’s root word. Lemmatization try and do so, by using detailed dictionaries which the algorithm looks trough to link a given word string back to it’s root word. This is a more advanced method than stemming and should be beneficial in this report.
We here want to primarily work with tidy text, where there is one token per row. So new a clean and filtered dataset is created both with tokens and as normal dataframe with the review text.
tokens_clean <- data_raw %>%
unnest_tokens(word, reviews.text, to_lower = TRUE) %>%
mutate(word = word %>% str_remove_all("[^a-zA-Z]")) %>%
filter(str_length(word) > 0) %>%
mutate(word = lemmatize_words(word))
reviewtext_lemma <- tokens_clean %>%
group_by(id) %>%
summarize(reviews.text = str_c(word, collapse = " ")) %>%
ungroup() %>%
select(reviews.text) %>%
as_vector()
data_clean <- data_raw %>%
mutate(reviews.text = reviewtext_lemma)
We now have 153.994 tokens, in their each seperate rows in the tokens dataset. By doing lemmatization the number of unique tokens are reduced from around 6000 to around 4600 words, which should prove quite beneficial.
In this assignment we want to use network analysis to gain new insights into how the reviews are structured. Here we extract bigrams from each review text, clean and prepare them to then create networks. Where we before considered tokens as individual words, we can create them as n-grams that are a consecutive sequence of words. Bigrams are n-grams with a length of 2 consecutive words. This can be used to gain context and connection between words.
Bigrams are now created, by unnesting the tokens.
bigrams <- data_clean %>%
unnest_tokens(bigram, reviews.text, token = "ngrams", n = 2) # n is the number of words to consider in each n-gram.
bigrams$bigram[1:2]
## [1] "the display" "display be"
Remember that each bigram overlap, as can be seen from above, so that the first token is “the display” and the second is “display is”. Now the most common bigrams are displayed.
#Counting common bigrams
bigrams %>%
count(bigram, sort = TRUE)
Notice the most common bigrams are: “for my”, “easy to”, “to use”, “it is”. These are mostly stopwords, which is not very usefull for the analysis. To remove these from the bigrams, we now split the bigram into 2 columns word1 and word2, and then filter them away if either of them is a stopword. The stopwords are taken from a dictionary called stop_words. Now we make a new a new count to see the most bigrams after filtering.
bigrams_separated <- bigrams %>%
separate(bigram,c("word1","word2"),sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
#New bigram counts
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
Above we can see that the most common bigrams are now mostly product names such as “kindle fire”, “battery life”, “amazon fire”, “amazon echo” ect. We now combine the 2 columns again into a single column with the bigram, to do further analysis. This is done by using the ‘tidyr’ function ‘unite’. The purpose is to treat the bigram as a ‘term in a document’.
The interesting thing is now to visualize the relationship between all words. To this we will use the package igraph. Before doing this we will need to create the graph from a data frame of the bigrams. Here nodes are the words, and the edges correspond to the connection between the two words in the bigram. The first word in the bigram is the column ‘from’, and the second word is ‘to’, and it’s therefore a directed network. The edges are given a weight corresponding to how many times it occures in the total amount of reviews called ‘n’. The weight is plotted as the alpha value, so more frequent bigrams have a darker colour, and vice versa. Only bigrams that occure more than 15 times are plotted in this network, as it otherwise would get to messy.
set.seed(123)
bigram_graph <- bigram_counts %>%
filter(n > 15) %>% #The occurence of the bigram is more than 15.
graph_from_data_frame()
a<- grid::arrow(type = "closed",length = unit(.15,"inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.05,'inches'))+
geom_node_point(color = "pink", size = 3) +
geom_node_text(aes(label = name),vjust=1,hjust=1) +
theme_void()
The plot above, give us some insights about the connection of words in the reviews. If we where to chose a random word in the graph, the most likely word to come afterwards would be the outgoing connection with the darkest colour. This way we can kinda predict what words that come next. Remember that the words have been lemmatized, so it shows the root word so the sentence created would not be grammatical correct, but would still carry the meaning as a whole.
We see many small connections such as customer -> service, sound -> quality, black -> friday and ect. Then we also have a bigger cluster where love is one of the key words. Many words such as kid, daughter, son, wife ect. point in the direction of love, and then outgoing edges from love is play, watch, alexa. Creating sentences such as “wife love alexa” or “kid love play”. So first we have the person, then the sentiment word love, and then the action they do or what they love. We see that amazon is a central word with many outgoing connections, as many things are called “amazon prime”, “amazon account” ect. Other key nodes are the product names such as “fire”, “kindle”, “hue”.
#Correlation bigrams
bigram_section <- tokens_clean %>%
filter(!word %in% stop_words$word)
word_pairs <- bigram_section %>%
pairwise_count(word, id, sort = TRUE)
word_pairs
word_cors <- bigram_section %>%
group_by(word) %>%
filter(n()>= 20) %>%
pairwise_cor(word,id,sort=TRUE)
word_cors
Maybe choose other names
word_cors %>%
filter(item1 %in% c("kindle","fire")) %>%
group_by(item1) %>%
top_n(6) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
Maybe add colour scale.
word_cors %>%
filter(correlation > .275) %>%
graph_from_data_frame() %>%
ggraph(layout="fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = TRUE) +
geom_node_point(color = "pink",size=3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
In this section we will analyze the xxxx.
Up until now special characters, numbers and special letter have been removed and the tokens have been unnested. Words have also been through a lemmatazation. We will start to look at the top 100 words.
tokens_clean %>% count(word, sort=TRUE) %>% head(100)
Before looking for our own stopwords, we will move all stopwords build into the package, tidytext, called SMART.
tokens_clean %>% anti_join(stop_words)
After that we will look throuh the tokens_clean dataframe again and remove our own stopwords, where we decied to remove these five stopwords.
own_stopwords <- tibble(word= c("im", "ive", "dont", "doesnt", "didnt"),
lexicon = "OWN")
Now we will remove out own stopwords. Afterwards we will filter first for ndoc, which is the total number of words in the document. Here we say that documents, here reviws, with less than five words in them.
tokens_stemmed <- tokens_clean %>%
anti_join(stop_words %>% bind_rows(own_stopwords), by = "word")
tokens_stemmed <- tokens_stemmed%>% add_count(id, name = "ndoc") %>% filter(ndoc > 5) %>% select(-ndoc)
We will now again look at the top words and again plot them.
topwords <- tokens_stemmed %>% count(word, sort=TRUE)
topwords %>%
top_n(20, n) %>%
ggplot(aes(x = word %>% fct_reorder(n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Word Counts",
x = "Frequency",
y = "Top Words")
And now we will look at a wordcloud for the top 50 words. So pretty.
wordcloud(topwords$word, topwords$n, random.order = FALSE,
max.words = 50, colors = brewer.pal(8,"Dark2"))
Up untill now, equal weight have been given to all words, but some are more rare than others. Term frequency–inverse document frequency or just tf-idf, is a way to analyze how important a word is to a document in a corpus:
\[\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)\] Here tf is the term-frequency and idf is the inverse document-frequency, a coefficient which is larger whenever the particular term is found in a lesser number of documents.
We tried to run a tf-idf analysis but we couldn’t really say anything from the analysis, probably because there’s a lot of documents. Every person has their own dictionary and a lot of words may appear very rare, and therefor they may be giving a high idf coefficient, which is why their tf-idf is high. If we were analyzing a number of books, the analyses may have made more sense.
Sentiment analysis refers to a use of text analysis to extract and identify subjective information, where it analyzises whether the words are positive or negative. In this section, we will be doing two sentiment analysis, first by identifying positive and negative words using the bing lexicon and after this using the afinn lexicon.
Before doing the sentiment analysis, we will quickly look a the distribution of the review ratings.
summary(tokens_stemmed$reviews.rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 4.515 5.000 5.000
Here, we can see that there is a overepresentation of positive reviews, where the mean is at 4.515 and the median at 5.00. This will contribute to how we do the rest of the sentiment analysis. There is 1134 one-star review rating, 584 two-star review rating, 3028 three-star review rating, 12070 four-star review rating and 30383 five-star review rating.
We wil start with the bing lexicon. The bing lexicon categorizes words in a binary fashion into positive and negative categories. Here, we are using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexcon with tokenized data. After this we can count the sentiments.
sentiment_bing <- tokens_stemmed %>% inner_join(get_sentiments("bing"))
sentiment_bing %>% count(sentiment)
Here we can see that there is a lot more positive than negative words, which we also explained earlier was due to the fact that there is a lot more positive then negative reviews.
Now we wil try to plot the sentiments, here grouped by positive and negative sentiments. We are plotting a word count, grouped by sentiment, showing the 10 most frequent negative and positive words.
sentiment_analysis <- sentiment_bing %>% filter(sentiment %in% c("positive"
, "negative"))
word_counts <- sentiment_analysis %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n))
ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales ="free") +
coord_flip() +
labs(title ="Sentiment Word Counts",x ="Words")
Here we can see the positive words are much more frequent than the negative words.
And now we will count all positive and negative words for all five stars reviews.
tokens_stemmed %>% inner_join(get_sentiments("bing")) %>% count(reviews.rating, sentiment)
And here we will show all the words in one table, where we again can see that the positive words are overrepresented compared to the negative words.
bing_word_counts <- tokens_stemmed %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
What’s interesting is that there is only 345 unique words being included. In the lexicon, there’s around 6000 unique words, but apparently there must either be a lot of words not in out data not included in the lexicon, or there may be a lot of neutral words.
Now we will find the overall sentiment score for every review rating, taking the positive sentiments and subtracting the negative.
And then we will plot it.
ggplot(tokens_stemmed_bing, aes(x = reviews.rating, y = overall_sentiment, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
labs(title = "Overall Sentiment by Review rating" , subtitle = "Reviews", x = "Review rating"
, y = "Overall Sentiment")
What’s interesting is that rating 5 is very dominant, therefor we will also look at the mean score for all five ratings to see how they distribute. First we will find the mean score for all five rating.
n1 <- tokens_stemmed %>% filter(reviews.rating == 1)
s1 <- tokens_stemmed_bing$overall_sentiment[1] / count(n1)
n2 <- tokens_stemmed %>% filter(reviews.rating == 2)
s2 <- tokens_stemmed_bing$overall_sentiment[2] / count(n2)
n3 <- tokens_stemmed %>% filter(reviews.rating == 3)
s3 <- tokens_stemmed_bing$overall_sentiment[3] / count(n3)
n4 <- tokens_stemmed %>% filter(reviews.rating == 4)
s4 <- tokens_stemmed_bing$overall_sentiment[4] / count(n4)
n5 <- tokens_stemmed %>% filter(reviews.rating == 5)
s5 <- tokens_stemmed_bing$overall_sentiment[5] / count(n5)
Then we will combine the five mean values in a matrix.
x <- c(s1,s2,s3,s4,s5)
And then plot the values.
ggplot(tokens_stemmed_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
labs(title = "Overall Sentiment by Review rating" , subtitle = "Reviews", x = "Review rating"
, y = "Overall Sentiment")
The scores are generally very low, because there wasn’t a whole lot of words from the lexicon. But here we can see that the 1 score rating is much lower and almost have the same negative size as the five-star rating, which is actually an interesting insight.
Now, we will analyzie the data using the afinn lexicon, which gives every word a score between -5 and 5. We are again using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexcon with tokenized data. After this we can summarize the value of each review rating.
And now we can plot it. Th large numbers are even larger compared to the bing lexicon.
sentiment_afinn
And herafter we can again plot it.
ggplot(sentiment_afinn,aes(x = reviews.rating, y = sentiment,
fill = as.factor(reviews.rating))) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title ="Overall Sentiment by Stars",subtitle ="Reviews for Robotic Vacuums",
x ="Stars",y ="Overall Sentiment")
labs(title ="Sentiment Word Counts",x ="Words")
## $x
## [1] "Words"
##
## $title
## [1] "Sentiment Word Counts"
##
## attr(,"class")
## [1] "labels"
What’s interesting here is that all reviews, except one star rating, have a positive score. This could be because positive words have been giving a higher rating overall than negative words. Of course this could also be, because it categorizes negative words as positive words or vice versa, like “it’s amazing how bad it is” may categorize “amazing” (4) as a positive word with a high value even if it used negative.
Again, it could be interesting to look at the mean scores of each rating review, because of the distribution of the review ratings.
n1 <- tokens_stemmed %>% filter(reviews.rating == 1)
s1 <- sentiment_afinn$sentiment[1] / count(n1)
n2 <- tokens_stemmed %>% filter(reviews.rating == 2)
s2 <- sentiment_afinn$sentiment[2] / count(n2)
n3 <- tokens_stemmed %>% filter(reviews.rating == 3)
s3 <- sentiment_afinn$sentiment[3] / count(n3)
n4 <- tokens_stemmed %>% filter(reviews.rating == 4)
s4 <- sentiment_afinn$sentiment[4] / count(n4)
n5 <- tokens_stemmed %>% filter(reviews.rating == 5)
s5 <- sentiment_afinn$sentiment[5] / count(n5)
Then we will combine the five mean values in a matrix.
x <- c(s1,s2,s3,s4,s5)
And then plot the values.
ggplot(tokens_stemmed_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
labs(title = "Overall Sentiment by Review rating" , subtitle = "Reviews", x = "Review rating"
, y = "Overall Sentiment")
Here, we can see that the positive words have been giving a higher overall value than the negative reviews, which is quiet interesting. So overall, with this dictionary, five star rating have a higher number of positive minus negative words than one start rating have of the same thing. Compared to the same plot from the bing lexicon, where five star and one star was almost the same size, one star is much smaller compared to five stars in this plot.
Latent Semantic Analysis or simply LSA is a techique to identify and analyze the cooccurrences of words across documents. Coorccurrence suggest that the words are somewhat correlated, either by being synonymous or reflect a shared concept. Examples of shared concepts could be colors or cities. We want to extract meanings between documents and words, assuming that words that are close in meaning will appear in similar pieces of texts.
First we create a sparse document-feature matrix from the corpus. Here we turn a tidy one-term-per-document-per-row data frame into a TermDocumentMatrix from the quanteda package.
#Document-feature-matrix
data_dfm = tokens_stemmed %>% count(id, word) %>% cast_dfm(document = id, term = word, value = n)
Here, we have a document-feature matrix with 3306 documents (reviews) and 3542 feautures (words).
Then we fit the LSA scaling model to the dfm, where we set nd, the number of dimensions, to 10.
And now we pull out the feautures and change them to a data frame. We’ll start to look at the features, hence words.
data_lsa_loading <- data_dfm1$features %>%
as.data.frame() %>%
rownames_to_column(var = "word") %>%
as_tibble()
Now we can use the umap function, which stands for Uniform Manifold Approximation and Projection, a technique for dimension reduction. The function computes a manifold approximation and projection.
And then we transform it into a dataframe.
data_lsa_umap %<>% as.data.frame()
And then a transform it into a matrix and assign it a different name. The function hdbscan computes the hierarchical cluster tree, where minPts is the minimum size of the clusters. We are dealing with 3542, so a minimum could be 200 points.
set.seed(123)
data_lsa_hdbscan <- data_lsa_umap %>% as.matrix() %>% hdbscan(minPts = 200)
And now we can plot the features, here in a two dimensional plot. Here it’s clusters after the function clutser for the hbdscan and it assigns a prob for each data point, which is a probability of a data point within its cluster, which runs from 0 to 1.
set.seed(123)
x = data_lsa_umap %>%
bind_cols(cluster = data_lsa_hdbscan$cluster %>% as.factor(),
prob = data_lsa_hdbscan$membership_prob) %>%
ggplot(aes(x = V1, y = V2, col = cluster)) +
geom_point(aes(alpha = prob), shape = 21)
ggplotly(x)
Here, the function in R has reduced the number of dimensions in the data set using the latent features of the data.
And now we run a table to see how many feautures are in each cluster.
table(data_lsa_hdbscan$cluster)
##
## 0 1 2
## 91 212 3240
Here we can see the interactive plot for the features and how they cluster together.
Here, there’s two different clusters and 91 outliers. One of the clusters is quiet big and have 3240 out of the 3542 features. The others are smaller and there’s 91 outliers, which doesn’t have a cluster. There could be lot more clusters, because each cluster should have a minimum of 100 features, but the function only makes two clusters. Unfortunately, we will not go any further regarding which words are in which clusters.
Now, we will move on to analyzing the reviews ans how they cluster.
data_lsa_loading <- data_dfm1$docs %>%
as.data.frame() %>%
rownames_to_column(var = "id") %>%
as_tibble()
Now we can use the umap function, which stands for Uniform Manifold Approximation and Projection, a technique for dimension reduction. The function computes a manifold approximation and projection.
And then we transform it into a dataframe.
data_lsa_umap %<>% as.data.frame()
And then a transform it into a matrix and assign it a different name. The function hdbscan computes the hierarchical cluster tree, where minPts is the minimum size of the clusters. We are dealing with 3542, so a minimum could be 200 points.
set.seed(123)
data_lsa_hdbscan <- data_lsa_umap %>% as.matrix() %>% hdbscan(minPts = 200)
And now we can plot the features, here in a two dimensional plot. Here it’s clusters after the function clutser for the hbdscan and it assigns a prob for each data point, which is a probability of a data point within its cluster, which runs from 0 to 1.
set.seed(123)
x = data_lsa_umap %>%
bind_cols(cluster = data_lsa_hdbscan$cluster %>% as.factor(),
prob = data_lsa_hdbscan$membership_prob) %>%
ggplot(aes(x = V1, y = V2, col = cluster)) +
geom_point(aes(alpha = prob), shape = 21)
ggplotly(x)
And now we run a table to see how many feautures are in each cluster.
table(data_lsa_hdbscan$cluster)
##
## 0 1 2 3 4 5
## 338 616 748 283 352 969
Here the documents cluster differently than the features and they are much more spread out, even when the minimum points are the same. There’s also more outliers compared to the last plot. Here there’s five clusters, which could be the five different rating the documents have clustered after, but it’s latent features.
The topicmodels package requires a document-term matrix as input: By using the function cast_dtm and tidytext we can easily produce it. The matrix have to be term-frequency weighted. We do so using the weightTf function of the TM package for the weighting argument:
data_dtm <- tokens_stemmed %>%
count(id, word) %>%
cast_dtm(document = id, term = word, value = n, weighting = tm::weightTf)
data_dtm
## <<DocumentTermMatrix (documents: 3306, terms: 3543)>>
## Non-/sparse entries: 40576/11672582
## Sparsity : 100%
## Maximal term length: 18
## Weighting : term frequency (tf)
The matrix is rather sparse (Sparsity = 100%). We can try to reduce this by deleting less often used terms.
data_dtm %>% removeSparseTerms(sparse = .99)
## <<DocumentTermMatrix (documents: 3306, terms: 245)>>
## Non-/sparse entries: 26472/783498
## Sparsity : 97%
## Maximal term length: 13
## Weighting : term frequency (tf)
The Sparsity is now 97% which is less than before but still rather sparce. The number of terms went from 3543 to 245. Which is a too high reduction.
data_dtm %>% removeSparseTerms(sparse = .999)
## <<DocumentTermMatrix (documents: 3306, terms: 1224)>>
## Non-/sparse entries: 37198/4009346
## Sparsity : 99%
## Maximal term length: 14
## Weighting : term frequency (tf)
The Sparsity is now 99% which is higher than before. The number of terms is now 1224 (vs. 3543). That’s 1/3 of the ‘original’ terms.
data_dtm %>% removeSparseTerms(sparse = .9999)
## <<DocumentTermMatrix (documents: 3306, terms: 3543)>>
## Non-/sparse entries: 40576/11672582
## Sparsity : 100%
## Maximal term length: 18
## Weighting : term frequency (tf)
The results above is just the exact the same as before we tried to remove the sparse terms. It doesn’t seems like it’s worth to try to reduce the sparsity vs. the reduction of the terms. Therefore we’ll just accept a high level of sparsity (100%) to keep all of the terms.
Next we perfome a LDA. We’re using the “Gibbs” sampling as method.
Above the top 10 terms in each LDA topic are displayed. We choose the number of two clusters since choosing a higher number results in the same words displayed in two or more clusters.
It seems like cluster 2 contains some words with a tecnological character (echo, screen, app, alexa, device) while cluster 1 seems related to books/reading (read, book, kindle) and positive words (love, easy).
Introduction: What do we want to predict.
We split the data in a train and test dataset, using the proportions 3/4, which is standard in the “initial_split” command.
data_split <- data_clean %>%
select(id) %>%
initial_split()
train_data <- training(data_split)
test_data <- testing(data_split)
Maybe we want to filter away infrequent words here to get fewer features. Transforming training data to a sparse Matrix, so a given cell is either empty or indicating the frequency it occurs by in a given document.
sparse_words <- tokens_clean %>%
count(id, word) %>%
inner_join(train_data) %>%
cast_sparse(id, word, n)
dim(sparse_words)
## [1] 3750 3677
In our training set, we have 3750 observertions and the 3677 features, which is the different tokens. We could also cbind other columns such as numeric data into the matrix here, such as sentiment numbers.
We now build a dataframe with the response variable for the ratings.
word_rownames <- as.integer(rownames(sparse_words))
data_joined <- data_frame(id = word_rownames) %>%
left_join(data_clean %>% select(id, reviews.rating))
rating_equal_5 <- data_joined$reviews.rating == "5"
model <-cv.glmnet(sparse_words, rating_equal_5, family="binomial",
parallel = TRUE, keep = TRUE)
plot(model)
plot(model$glmnet.fit)
coefs <- model$glmnet.fit %>%
tidy() %>%
filter(lambda == model$lambda.1se)
coefs %>%
group_by(estimate > 0) %>%
top_n(10, abs(estimate)) %>%
ungroup() %>%
ggplot(aes(fct_reorder(term, estimate),estimate, fill =estimate > 0)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
coord_flip() +
labs(x = NULL, title = "Coefficients that increase/decrease probability the most")
This means that “terrible” and “demonstrate” are unlikely to be written in a review.
intercept <- coefs %>%
filter(term == "(Intercept)") %>%
pull(estimate)
classifications <- tokens_clean %>%
inner_join(test_data) %>%
inner_join(coefs, by = c("word" = "term")) %>%
group_by(id) %>%
summarize(score = sum(estimate)) %>%
mutate(probability = plogis(intercept + score))
hist(classifications$probability)
#comment_classes <- classifications %>%
# left_join(data_clean %>% select(reviews.rating, id),by = "id") %>%
# mutate(reviews.rating = as.factor(reviews.rating))
#
#comment_classes %>%
# roc_curve(as.factor(reviews.rating=="5"), probability) %>%
# ggplot(aes(x = 1 - sensitivity , y = sensitivity)) +
# geom_line(
# color = "midnightblue",
# size = 1.5) +
# geom_abline(lty = 2, alpha = 0.5,
# color = "gray50",
# size = 1.2) +
# labs(
# title = "ROC curve for text classification using regularized regression",
# subtitle = "Predicting whether text was written by Jane Austen or H.G. Wells"
# )
#for at jeg kunne printe den, har jeg lige markeret med hastag
#comment_classes %>%
#roc_auc(as.factor(reviews.rating=="5"),probability)